In the rapidly evolving landscape of financial technology, credit scoring remains a cornerstone in determining the creditworthiness of applicants. This paper delves into the realm of predictive modeling, using a dataset provided by our professors, enriched with comprehensive credit scoring information. The overarching objective of our study is to train a model that efficiently and accurately determines the eligibility of individuals for credit approval.

The task set forth by our professors involves a comprehensive journey through four key exercises. The first exercise is centered around a thorough analysis of the dataset. Here, we delve into understanding the underlying structures, identifying critical variables, and most importantly, balancing the dataset to ensure a fair representation of diverse credit scenarios.

Subsequently, in the second exercise, we focus on training and testing a logistic classifier. This step is crucial in establishing a baseline model from which further improvements can be measured.

The third exercise involves enhancing the predictive performance of the model. This stage is particularly challenging and significant, as it entails fine-tuning the model to ensure it captures the nuances of credit scoring with greater accuracy and efficiency.

Finally, the fourth exercise invites us to step into the practical world. Here, we explore the various challenges a company may face if they were to implement our model in a real-world scenario. This exercise not only grounds our theoretical work in reality but also provides valuable insights into the practical implications and considerations in deploying machine learning models in the credit industry.

This paper aims to not only present a robust model for credit scoring but also to contribute to the broader understanding of how machine learning can be effectively utilized in financial decision-making processes.

Load packages and data

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]
## NULL
## 
## [[15]]
## NULL
## 
## [[16]]
## NULL
## 
## [[17]]
## NULL
## 
## [[18]]
## NULL
## 
## [[19]]
## NULL
## 
## [[20]]
## NULL
## 
## [[21]]
## NULL
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'lubridate' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'reshape2'
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
## Warning: package 'plotly' was built under R version 4.3.1
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## Warning: package 'lattice' was built under R version 4.3.1
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## Attaching package: 'DescTools'
## 
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## Warning: package 'pROC' was built under R version 4.3.1
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## 
## Loaded ROSE 0.0-4
## 
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## 
## Attaching package: 'PerformanceAnalytics'
## 
## The following object is masked from 'package:graphics':
## 
##     legend
## 
## 
## Attaching package: 'reshape'
## 
## The following object is masked from 'package:plotly':
## 
##     rename
## 
## The following objects are masked from 'package:reshape2':
## 
##     colsplit, melt, recast
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
## 
## The following object is masked from 'package:dplyr':
## 
##     rename
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
## Warning: package 'knitr' was built under R version 4.3.1
## 
## Attaching package: 'dlookr'
## 
## The following object is masked from 'package:corrr':
## 
##     correlate
## 
## The following objects are masked from 'package:PerformanceAnalytics':
## 
##     kurtosis, skewness
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## The following object is masked from 'package:base':
## 
##     transform
## 
## corrplot 0.92 loaded
## [[1]]
## [1] "readr"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[5]]
##  [1] "reshape2"  "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[6]]
##  [1] "plotly"    "reshape2"  "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "readr"    
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[7]]
##  [1] "caret"     "lattice"   "plotly"    "reshape2"  "lubridate" "forcats"  
##  [7] "stringr"   "dplyr"     "purrr"     "tidyr"     "tibble"    "ggplot2"  
## [13] "tidyverse" "readr"     "stats"     "graphics"  "grDevices" "utils"    
## [19] "datasets"  "methods"   "base"     
## 
## [[8]]
##  [1] "DescTools" "caret"     "lattice"   "plotly"    "reshape2"  "lubridate"
##  [7] "forcats"   "stringr"   "dplyr"     "purrr"     "tidyr"     "tibble"   
## [13] "ggplot2"   "tidyverse" "readr"     "stats"     "graphics"  "grDevices"
## [19] "utils"     "datasets"  "methods"   "base"     
## 
## [[9]]
##  [1] "pROC"      "DescTools" "caret"     "lattice"   "plotly"    "reshape2" 
##  [7] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "tidyr"    
## [13] "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"     "graphics" 
## [19] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[10]]
##  [1] "ROCR"      "pROC"      "DescTools" "caret"     "lattice"   "plotly"   
##  [7] "reshape2"  "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
## [13] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"    
## [19] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[11]]
##  [1] "ROSE"      "ROCR"      "pROC"      "DescTools" "caret"     "lattice"  
##  [7] "plotly"    "reshape2"  "lubridate" "forcats"   "stringr"   "dplyr"    
## [13] "purrr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "readr"    
## [19] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [25] "base"     
## 
## [[12]]
##  [1] "Boruta"    "ROSE"      "ROCR"      "pROC"      "DescTools" "caret"    
##  [7] "lattice"   "plotly"    "reshape2"  "lubridate" "forcats"   "stringr"  
## [13] "dplyr"     "purrr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [19] "readr"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [25] "methods"   "base"     
## 
## [[13]]
##  [1] "RColorBrewer" "Boruta"       "ROSE"         "ROCR"         "pROC"        
##  [6] "DescTools"    "caret"        "lattice"      "plotly"       "reshape2"    
## [11] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [16] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "readr"       
## [21] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [26] "methods"      "base"        
## 
## [[14]]
##  [1] "ggcorrplot"   "RColorBrewer" "Boruta"       "ROSE"         "ROCR"        
##  [6] "pROC"         "DescTools"    "caret"        "lattice"      "plotly"      
## [11] "reshape2"     "lubridate"    "forcats"      "stringr"      "dplyr"       
## [16] "purrr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [21] "readr"        "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"        
## 
## [[15]]
##  [1] "PerformanceAnalytics" "xts"                  "zoo"                 
##  [4] "ggcorrplot"           "RColorBrewer"         "Boruta"              
##  [7] "ROSE"                 "ROCR"                 "pROC"                
## [10] "DescTools"            "caret"                "lattice"             
## [13] "plotly"               "reshape2"             "lubridate"           
## [16] "forcats"              "stringr"              "dplyr"               
## [19] "purrr"                "tidyr"                "tibble"              
## [22] "ggplot2"              "tidyverse"            "readr"               
## [25] "stats"                "graphics"             "grDevices"           
## [28] "utils"                "datasets"             "methods"             
## [31] "base"                
## 
## [[16]]
##  [1] "corrr"                "PerformanceAnalytics" "xts"                 
##  [4] "zoo"                  "ggcorrplot"           "RColorBrewer"        
##  [7] "Boruta"               "ROSE"                 "ROCR"                
## [10] "pROC"                 "DescTools"            "caret"               
## [13] "lattice"              "plotly"               "reshape2"            
## [16] "lubridate"            "forcats"              "stringr"             
## [19] "dplyr"                "purrr"                "tidyr"               
## [22] "tibble"               "ggplot2"              "tidyverse"           
## [25] "readr"                "stats"                "graphics"            
## [28] "grDevices"            "utils"                "datasets"            
## [31] "methods"              "base"                
## 
## [[17]]
##  [1] "networkD3"            "corrr"                "PerformanceAnalytics"
##  [4] "xts"                  "zoo"                  "ggcorrplot"          
##  [7] "RColorBrewer"         "Boruta"               "ROSE"                
## [10] "ROCR"                 "pROC"                 "DescTools"           
## [13] "caret"                "lattice"              "plotly"              
## [16] "reshape2"             "lubridate"            "forcats"             
## [19] "stringr"              "dplyr"                "purrr"               
## [22] "tidyr"                "tibble"               "ggplot2"             
## [25] "tidyverse"            "readr"                "stats"               
## [28] "graphics"             "grDevices"            "utils"               
## [31] "datasets"             "methods"              "base"                
## 
## [[18]]
##  [1] "reshape"              "networkD3"            "corrr"               
##  [4] "PerformanceAnalytics" "xts"                  "zoo"                 
##  [7] "ggcorrplot"           "RColorBrewer"         "Boruta"              
## [10] "ROSE"                 "ROCR"                 "pROC"                
## [13] "DescTools"            "caret"                "lattice"             
## [16] "plotly"               "reshape2"             "lubridate"           
## [19] "forcats"              "stringr"              "dplyr"               
## [22] "purrr"                "tidyr"                "tibble"              
## [25] "ggplot2"              "tidyverse"            "readr"               
## [28] "stats"                "graphics"             "grDevices"           
## [31] "utils"                "datasets"             "methods"             
## [34] "base"                
## 
## [[19]]
##  [1] "knitr"                "reshape"              "networkD3"           
##  [4] "corrr"                "PerformanceAnalytics" "xts"                 
##  [7] "zoo"                  "ggcorrplot"           "RColorBrewer"        
## [10] "Boruta"               "ROSE"                 "ROCR"                
## [13] "pROC"                 "DescTools"            "caret"               
## [16] "lattice"              "plotly"               "reshape2"            
## [19] "lubridate"            "forcats"              "stringr"             
## [22] "dplyr"                "purrr"                "tidyr"               
## [25] "tibble"               "ggplot2"              "tidyverse"           
## [28] "readr"                "stats"                "graphics"            
## [31] "grDevices"            "utils"                "datasets"            
## [34] "methods"              "base"                
## 
## [[20]]
##  [1] "dlookr"               "knitr"                "reshape"             
##  [4] "networkD3"            "corrr"                "PerformanceAnalytics"
##  [7] "xts"                  "zoo"                  "ggcorrplot"          
## [10] "RColorBrewer"         "Boruta"               "ROSE"                
## [13] "ROCR"                 "pROC"                 "DescTools"           
## [16] "caret"                "lattice"              "plotly"              
## [19] "reshape2"             "lubridate"            "forcats"             
## [22] "stringr"              "dplyr"                "purrr"               
## [25] "tidyr"                "tibble"               "ggplot2"             
## [28] "tidyverse"            "readr"                "stats"               
## [31] "graphics"             "grDevices"            "utils"               
## [34] "datasets"             "methods"              "base"                
## 
## [[21]]
##  [1] "corrplot"             "dlookr"               "knitr"               
##  [4] "reshape"              "networkD3"            "corrr"               
##  [7] "PerformanceAnalytics" "xts"                  "zoo"                 
## [10] "ggcorrplot"           "RColorBrewer"         "Boruta"              
## [13] "ROSE"                 "ROCR"                 "pROC"                
## [16] "DescTools"            "caret"                "lattice"             
## [19] "plotly"               "reshape2"             "lubridate"           
## [22] "forcats"              "stringr"              "dplyr"               
## [25] "purrr"                "tidyr"                "tibble"              
## [28] "ggplot2"              "tidyverse"            "readr"               
## [31] "stats"                "graphics"             "grDevices"           
## [34] "utils"                "datasets"             "methods"             
## [37] "base"
## Rows: 40000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): grade, home_ownership, verification_status, purpose, application_type
## dbl (12): loan_amnt, int_rate, annual_inc, dti, open_acc, revol_bal, revol_u...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Excercise 1

1.1 Describe the data. Specifically:

1.1.1: Check and report the structure of the data set.

str(loan_sample)
## spc_tbl_ [40,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ loan_amnt          : num [1:40000] 10000 16000 6000 9000 24000 8000 3600 5550 2000 10000 ...
##  $ int_rate           : num [1:40000] 11.47 9.49 17.99 9.75 12.59 ...
##  $ grade              : chr [1:40000] "B" "B" "D" "B" ...
##  $ home_ownership     : chr [1:40000] "RENT" "MORTGAGE" "RENT" "MORTGAGE" ...
##  $ annual_inc         : num [1:40000] 35000 110000 40000 54000 66000 ...
##  $ verification_status: chr [1:40000] "Verified" "Not Verified" "Verified" "Verified" ...
##  $ purpose            : chr [1:40000] "debt_consolidation" "debt_consolidation" "home_improvement" "car" ...
##  $ dti                : num [1:40000] 22.05 15.67 17.34 3.58 12.25 ...
##  $ open_acc           : num [1:40000] 15 9 8 11 8 10 8 5 8 14 ...
##  $ revol_bal          : num [1:40000] 10211 10068 3755 1459 29656 ...
##  $ revol_util         : num [1:40000] 31.5 32.7 25.5 22.8 70.6 34.3 36.1 66.1 24.6 75.6 ...
##  $ total_acc          : num [1:40000] 17 21 19 34 21 17 9 12 30 21 ...
##  $ total_rec_int      : num [1:40000] 1574 2364 943 436 2778 ...
##  $ application_type   : chr [1:40000] "Individual" "Individual" "Individual" "Joint App" ...
##  $ tot_cur_bal        : num [1:40000] 17440 35104 19995 8713 29656 ...
##  $ total_rev_hi_lim   : num [1:40000] 32400 30800 14700 6400 42000 28200 21400 6200 17500 30300 ...
##  $ Status             : num [1:40000] 0 0 0 0 0 0 1 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   loan_amnt = col_double(),
##   ..   int_rate = col_double(),
##   ..   grade = col_character(),
##   ..   home_ownership = col_character(),
##   ..   annual_inc = col_double(),
##   ..   verification_status = col_character(),
##   ..   purpose = col_character(),
##   ..   dti = col_double(),
##   ..   open_acc = col_double(),
##   ..   revol_bal = col_double(),
##   ..   revol_util = col_double(),
##   ..   total_acc = col_double(),
##   ..   total_rec_int = col_double(),
##   ..   application_type = col_character(),
##   ..   tot_cur_bal = col_double(),
##   ..   total_rev_hi_lim = col_double(),
##   ..   Status = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Checking for NAs
any(is.na(loan_sample))
## [1] FALSE

The result of “FALSE” from the check indicates that there are no missing values (NAs) in the “loan_sample” dataset. So the dataset is provided with full values means that we can work with this dataset without cleaning NA’s.

1.1.2: How many numeric and how many categorical variables are included in the data? What categorical variable has the most levels in it?

# First we have to check whether all variables have been imported in the correct format. 
first_overview <- overview(loan_sample)
plot(first_overview)

It is clearly visible that the columns with the data type character still have to be converted to factors. In addition, the “Status” column is currently still a numeric data type. This must also be converted to factor, as it is a binary variable. To do further analysis this needs to be done to get further data with the same structure to use for our model and analysis.

loan_sample <- loan_sample %>%
  mutate_if(is.character, as.factor) %>% # Convert all character columns to factors
  mutate(Status = as.factor(Status)) # Convert 'Status' column to factor

# Plot to see, if the datatypes are correct
second_overview <- overview(loan_sample)
plot(second_overview)

That looks right, now we can count the number of variables. As we can see, there are no anomalies in this bar chart accordingly, we should be able to proceed with these converted data.

# Count numeric variables in 'loan_sample'
numeric_vars_count <- sum(sapply(loan_sample, is.numeric))

# Count categorical (factor) variables in 'loan_sample'
categorical_vars_count <- sum(sapply(loan_sample, is.factor))

# Print counts of numeric and categorical variables
cat("There are", numeric_vars_count, "numeric variables and", categorical_vars_count, "categorical variables in the dataset.")
## There are 11 numeric variables and 6 categorical variables in the dataset.
categorical_levels <- sapply(loan_sample, function(x) if(is.factor(x)) length(unique(x)) else NA)

# Find the name of the categorical variable with the most levels
cat_var_most_levels <- names(which.max(categorical_levels))

# Find the number of levels for this variable
levels_count <- max(categorical_levels, na.rm = TRUE)

# Print the variable name and the number of levels
cat("The categorical variable with the most levels in the dataset is:", cat_var_most_levels, "with", levels_count, "levels.")
## The categorical variable with the most levels in the dataset is: purpose with 13 levels.

1.1.3: Summarize the variables. Discuss the summary statistics obtained.

# Categorial Label with the most Levels
summary(loan_sample)
##    loan_amnt        int_rate     grade      home_ownership    annual_inc    
##  Min.   : 1000   Min.   : 5.31   A: 7329   MORTGAGE:17701   Min.   :  5000  
##  1st Qu.: 7000   1st Qu.: 9.44   B:13166   OWN     : 3981   1st Qu.: 42000  
##  Median :10075   Median :12.29   C:11842   RENT    :18318   Median : 57000  
##  Mean   :11687   Mean   :12.62   D: 7663                    Mean   : 63400  
##  3rd Qu.:15000   3rd Qu.:15.05                              3rd Qu.: 77000  
##  Max.   :40000   Max.   :27.49                              Max.   :400000  
##                                                                             
##       verification_status               purpose           dti       
##  Not Verified   :14373    debt_consolidation:23342   Min.   : 0.00  
##  Source Verified:16116    credit_card       : 9362   1st Qu.:12.13  
##  Verified       : 9511    other             : 2337   Median :17.60  
##                           home_improvement  : 2099   Mean   :18.22  
##                           major_purchase    :  794   3rd Qu.:23.86  
##                           medical           :  444   Max.   :60.14  
##                           (Other)           : 1622                  
##     open_acc       revol_bal       revol_util       total_acc    
##  Min.   : 1.00   Min.   :    0   Min.   :  0.00   Min.   : 3.00  
##  1st Qu.: 8.00   1st Qu.: 5615   1st Qu.: 34.80   1st Qu.:15.00  
##  Median :10.00   Median : 9818   Median : 52.40   Median :20.00  
##  Mean   :10.31   Mean   :11995   Mean   : 52.17   Mean   :21.27  
##  3rd Qu.:13.00   3rd Qu.:15832   3rd Qu.: 70.00   3rd Qu.:27.00  
##  Max.   :23.00   Max.   :78762   Max.   :121.40   Max.   :57.00  
##                                                                  
##  total_rec_int      application_type  tot_cur_bal     total_rev_hi_lim
##  Min.   :   0.0   Individual:39450   Min.   :     0   Min.   :   300  
##  1st Qu.: 673.2   Joint App :  550   1st Qu.: 25136   1st Qu.: 13000  
##  Median :1342.5                      Median : 53722   Median : 20900  
##  Mean   :1818.5                      Mean   : 99100   Mean   : 24194  
##  3rd Qu.:2432.9                      3rd Qu.:157834   3rd Qu.: 32200  
##  Max.   :8834.9                      Max.   :472573   Max.   :100000  
##                                                                       
##  Status   
##  0:34810  
##  1: 5190  
##           
##           
##           
##           
## 

Our dataset contains information from 40,000 loans, giving us a detailed look into how people handle their finances. On average, folks are requesting loans around $11,687, but it’s interesting to note that the range is pretty wide, going from as little as $1,000 to as much as $40,000. This tells us that people have different needs when it comes to borrowing money.

Now, when it comes to interest rates, the average is about 12.62%, but it’s quite a spread. Some lucky ones are scoring rates as low as 5.31%, while others are dealing with rates as high as 27.49%. It’s a reminder that what you end up with can vary quite a bit.

On the income side of things, the average yearly income for loan applicants is $63,400. But, as expected, there’s a range here too. Some folks are pulling in more, and some are pulling in less. This is just a snapshot of the diversity in people’s financial situations.

We also took a look at the debt-to-income ratio (DTI), which gives us an idea of how much of a person’s income is tied up in paying off debt. On average, it’s sitting at 18.22%. It’s a useful measure for understanding how comfortably people can manage their debt.

Switching gears to credit lines, on average, people have around 10 open credit lines. But, if we look back over time, some have had as many as 21. This gives us insight into how people handle credit – how many cards or loans they’ve had in the past.

The average amount owed on these credit lines is $11,995, and people tend to use about half of their available credit. This shows that folks are using their credit but not maxing it out completely.

When it comes to interest payments, the average is around $1,818. That’s extra money on top of the initial loan amount. Looking at the big picture, the total average outstanding debt across all loans is $99,100. This tells us that, collectively, there’s a significant amount of money being borrowed.

Finally, the average credit limit on people’s credit lines is $24,194. This is the maximum amount the bank allows them to borrow. It’s a cap on their potential debt.

So, diving into these numbers not only gives us a sense of individual financial situations but also paints a broader picture of how people navigate their finances through loans.

1.1.4: Check the levels of the target variable by choosing the appropriate visualization. Is the target variable balanced?

# Plot the distribution of the target variable
ggplot(loan_sample, aes(x = Status, fill = Status)) + 
  geom_bar() +
  scale_fill_brewer(palette = "Set1") + 
  labs(title = 'Distribution of target variable (Status)', x = 'Status', y = 'Count') +
  theme_minimal()

The ‘Status’ variable in our dataset serves as a crucial indicator of credit approval outcomes. It distinctly classifies these outcomes into two levels: ‘0’ represents a ‘No,’ indicating that the credit has not been approved, while ‘1’ signifies a ‘Yes,’ confirming the approval of the credit.

Upon closer inspection, our dataset reveals a notable imbalance in the distribution of these outcomes. Specifically, there is a significantly higher number of instances where credits were not approved (Status ‘0’) compared to those that were approved (Status ‘1’). This imbalance is a critical aspect to consider, especially in the context of machine learning model development.

In machine learning, an unbalanced dataset can introduce bias into the predictive model. In our case, the imbalance may lead the model to overemphasize the majority class (non-approved credits), potentially resulting in a classifier that tends to predict a ‘No’ more frequently.

To mitigate this issue, it becomes essential to address the class imbalance before training our predictive model. Balancing the dataset involves techniques such as oversampling the minority class (approved credits), undersampling the majority class (non-approved credits), or using more advanced methods like synthetic data generation. By doing so, we aim to create a more equitable representation of both outcomes in our training data.

1.1.5: Check the distribution of the numeric variables in the data set (include different visual representations).

# Histograms
loan_sample %>% 
  select_if(is.numeric) %>% 
  gather(key = "variable", value = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~variable, scales = 'free_x') +
  theme_minimal()

The histogram we created for the variable open_acc has gaps. We should examine this more closely.

# Create boxplot of the variable total_acc
ggplot(loan_sample, aes(x = open_acc)) +
  geom_histogram(bins = 22) + # Adjust the number of bins as necessary
  labs(x = "Open Accounts", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Since the variable ranges from 1 to 22, using 30 bins leads to empty spaces. Adjusting the bins to match the actual value range has provided a clearer, gap-free histogram.

# Boxplots
loan_sample %>% 
  select_if(is.numeric) %>% 
  gather(key = "variable", value = "value") %>%
  ggplot(aes(y = value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = 'free') +
  theme_minimal()

1.2: Investigating outliers with a box plot

Elaborate your view on how to proceed in dealing with the outliers and – if necessary – take appropriate action.

# Reshape the data to long format and scale the numeric values
loan_sample_long_scaled <- loan_sample %>%
  select(where(is.numeric)) %>%
  mutate(across(everything(), scale)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value")

# Create a boxplot with the scaled values
ggplot(loan_sample_long_scaled, aes(x = variable, y = value)) +
  geom_boxplot() +
  labs(x = "", y = "Scaled Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

The box plot effectively captures the distribution of scaled numeric variables, and the presence of outliers is evident as individual points outside the box plot. Notably, several variables exhibit outliers, indicating the potential presence of extreme values in our dataset.

To gain a deeper understanding, it is essential to contextualize these outliers. Are they genuine extreme values reflective of the population, or do they stem from errors in data collection or measurement discrepancies? This distinction is crucial in determining the appropriate course of action.

Understanding the impact of outliers on our analysis is paramount. Outliers can significantly influence statistical measures, potentially skewing measures of central tendency and affecting the overall spread of the data. For instance, they may impact the accuracy of mean values and standard deviations.

A closer look at the box plot reveals specific variables that seem to have more pronounced outliers. Exploring these outliers in the context of the variables they pertain to could reveal valuable insights and guide subsequent steps in the analysis.

Various statistical techniques, such as calculating z-scores or leveraging the interquartile range (IQR), can systematically identify outliers. Implementing these techniques allows for a more rigorous and objective assessment of outlier presence.

When deciding how to address outliers, removal or correction is a common approach. However, alternative strategies exist, including transforming the data, winsorizing (capping extreme values), or stratifying analyses based on the presence of outliers. Each strategy comes with its considerations and potential implications for the overall analysis.

In summary, recognizing and addressing outliers is a nuanced process that requires careful consideration of their nature, impact, and appropriate handling strategies to ensure the integrity of our analysis.

# Diagnosing outliers and selecting variables with more than 1% outliers
variables_with_high_outlier_ratio <- loan_sample %>%
  diagnose_outlier() %>%
  filter(outliers_ratio > 1) %>%
  pull(variables)

# Plotting the outliers for these variables
loan_sample %>%
  plot_outlier(one_of(variables_with_high_outlier_ratio))

As we worked on refining our model, we initially thought removing more outliers would make it better, so we decided to keep 95% of our data and discard the top 5% as outliers. Surprisingly, this move didn’t improve the model; in fact, it made it less precise.

Realizing that less might be more in this case, we changed our approach. Instead of removing 5%, we opted for a more cautious strategy, cutting out only 1% of outliers. This way, we’re holding onto 99% of our data, hoping to strike a balance between cleaning up noisy data and keeping enough information for the model to learn effectively.

This back-and-forth shows that finding the right balance in handling outliers is a bit of a puzzle, and we’re trying to crack it to make our model as accurate and reliable as possible.

# Function for adjusting the values to the quantile limits
cap_values_at_quantiles <- function(x) {
    quantiles <- quantile(x, c(0.01, 0.99), na.rm = TRUE)
    x <- ifelse(x < quantiles[1], quantiles[1], x)
    x <- ifelse(x > quantiles[2], quantiles[2], x)
    x
}

loan_sample_truncated <- loan_sample %>%
  mutate(across(where(is.numeric), cap_values_at_quantiles))

1.3: Create a new balanced data set where the two levels of the target variable will be equally represented; Create a bar plot of the newly created target variable. Why is this step necessary?

loan_sample_balanced <- ovun.sample(Status ~ ., data=loan_sample_truncated, method = "under")
loan_sample_under <- data.frame(loan_sample_balanced[["data"]])

# Plot the balance
ggplot(loan_sample_under, aes(x = Status, fill = Status, group = Status)) +
  geom_bar() +
  ylab("Count") +
  xlab("Status of the loan") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  ggtitle("New balanced dataset")

Balancing the dataset is key for training models, especially when dealing with imbalanced classes like predicting loan default. This ensures a fair representation of both outcomes, preventing the model from favoring the more common class. By striking this balance, we aim to improve the model’s accuracy and fairness in predicting both common and less frequent events, contributing to a more reliable and equitable performance.

1.4: Investigating the distribution of the numeric features per the two levels of our target feature

(i.e. default vs non-default). Discuss the visualizations. Which variables seem to be relevant in predicting the target feature?

# Define columns to exclude (these are column with characters)
exclude_columns <- c(3, 4, 6, 7, 14)

# Loop through columns, excluding the specified ones
for (i in setdiff(1:ncol(loan_sample_under), exclude_columns)) {
  # Only plot numeric columns
  if (is.numeric(loan_sample_under[[i]])) {
    print(ggplot(loan_sample_under, aes_string(y = names(loan_sample_under)[i], color = "Status")) +
            geom_boxplot() +
            ylab(names(loan_sample_under[i])) +
            theme(axis.title.x = element_blank(),
                  axis.text.x = element_blank(),
                  axis.ticks.x = element_blank()))
  }
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

While looking at the boxplots, it’s clear that some variables don’t show much difference between the ‘Status’ ‘0’ and ‘1’ categories. Notably, interest rate, annual income, total revenue, and DTI stand out with significant differences between the two statuses. However, manually creating each boxplot can be time-consuming, and interpreting them might lack a systematic approach.

To tackle this, we decided on a more comprehensive method by combining manual boxplot analysis with insights from the Boruta algorithm. This hybrid approach blends both manual examination and algorithmic assistance, aiming for a nuanced understanding of the data and highlighting potential correlations. By doing this, we aim for a well-rounded outcome, drawing on the strengths of both manual and automated analyses to gain a more robust understanding of the dataset’s relationships.

This hybrid model allows for a more informed interpretation, ensuring that the observed deviations in specific variables are thoroughly considered within the broader dataset context. It represents a holistic and balanced approach, bringing together human expertise and algorithmic rigor to deepen our understanding of the correlation patterns in our data.

1.5: Use a bar plot visualization to investigate the associations between the categorical variables and the target feature.

# Generate a bar plot for each categorical variable
categorical_vars <- c("grade", "home_ownership", "verification_status", "purpose", "application_type")
# Loop over categorical variables and plot
for (cat_var in categorical_vars) {
  # Check if the column exists to avoid errors
  if (!cat_var %in% names(loan_sample_under)) {
    message(paste("Skipping", cat_var, "as it is not found in the dataset."))
    next
  }
  
  # Create the plot
  plot <- loan_sample_under %>%
    group_by(.data[[cat_var]], Status) %>%
    summarise(Count = n(), .groups = 'drop') %>%
    ggplot(aes(x = .data[[cat_var]], y = Count, fill = as.factor(Status))) +
    geom_bar(stat = "identity", position = position_dodge()) +
    labs(title = paste("Bar plot of", cat_var, "by Status"), x = cat_var, y = "Count") +
    scale_fill_discrete(name = "Status") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x labels for readability
   # Print the plot
  print(plot)
}

For this analysis, we made use of a loop to keep the code tidy. A noteworthy observation is that in stage D, the ratio of ‘1’ to ‘0’ is nearly twice as high, leading to a noticeable difference. On the flip side, in stage A, we see the opposite trend. Moving on to the Home Ownership plot, it becomes apparent that individuals who rent a house or apartment have higher chances of being creditworthy compared to those with a mortgage, where the odds are notably lower. Interestingly, no significant influence is observed for individuals who already own property.

In the subsequent graph, the significance of having a verified source of income on creditworthiness is emphasized. Moreover, it’s worth noting whether only the source is verified or if the exact data is also confirmed. This aligns logically with the idea that such verification substantially reduces the bank’s risk, thereby significantly enhancing data security.

Debt consolidation stands out with a high occurrence. We speculate that the data might be from the USA or other countries, as we believe such a trend might be more of an exclusion criterion in Switzerland. Nevertheless, debt consolidation can also make economic sense, allowing individuals to pay lower interest rates on existing loans. The criterion of owning a credit card is remarkably high, indicating their common use. However, it doesn’t seem to have a significant impact on the approval of credit applications.

The Application Type is intriguing, revealing that a lot more people opt for an individual loan, while very few consider a joint application. Surprisingly, though, it doesn’t play a substantial role in influencing credit approval.

1.6: Boruta algorithm

In the next step, we run the Boruta algorithm.The Boruta algorithm provides a more objective method for determining the importance of features as it is based on repeated, random comparisons rather than human judgment.

##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
##  11. run of importance source...
## After 11 iterations, +26 secs:
##  confirmed 12 attributes: annual_inc, dti, grade, home_ownership, int_rate and 7 more;
##  rejected 1 attribute: application_type;
##  still have 3 attributes left.
##  12. run of importance source...
##  13. run of importance source...
##  14. run of importance source...
##  15. run of importance source...
## After 15 iterations, +35 secs:
##  confirmed 2 attributes: open_acc, verification_status;
##  still have 1 attribute left.
##  16. run of importance source...
##  17. run of importance source...
##  18. run of importance source...
##  19. run of importance source...
##  20. run of importance source...
##  21. run of importance source...
##  22. run of importance source...
##  23. run of importance source...
##  24. run of importance source...
##  25. run of importance source...
##  26. run of importance source...
##  27. run of importance source...
##  28. run of importance source...
##  29. run of importance source...
##  30. run of importance source...
##  31. run of importance source...
##  32. run of importance source...
##  33. run of importance source...
##  34. run of importance source...
##  35. run of importance source...
##  36. run of importance source...
##  37. run of importance source...
##  38. run of importance source...
##  39. run of importance source...
##  40. run of importance source...
##  41. run of importance source...
##  42. run of importance source...
##  43. run of importance source...
##  44. run of importance source...
##  45. run of importance source...
##  46. run of importance source...
##  47. run of importance source...
##  48. run of importance source...
##  49. run of importance source...
##  50. run of importance source...
##  51. run of importance source...
##  52. run of importance source...
##  53. run of importance source...
##  54. run of importance source...
##  55. run of importance source...
##  56. run of importance source...
##  57. run of importance source...
##  58. run of importance source...
##  59. run of importance source...
##  60. run of importance source...
##  61. run of importance source...
##  62. run of importance source...
##  63. run of importance source...
##  64. run of importance source...
##  65. run of importance source...
##  66. run of importance source...
##  67. run of importance source...
##  68. run of importance source...
##  69. run of importance source...
##  70. run of importance source...
##  71. run of importance source...
##  72. run of importance source...
##  73. run of importance source...
##  74. run of importance source...
##  75. run of importance source...
##  76. run of importance source...
##  77. run of importance source...
##  78. run of importance source...
##  79. run of importance source...
##  80. run of importance source...
##  81. run of importance source...
##  82. run of importance source...
##  83. run of importance source...
##  84. run of importance source...
##  85. run of importance source...
##  86. run of importance source...
##  87. run of importance source...
##  88. run of importance source...
##  89. run of importance source...
##  90. run of importance source...
##  91. run of importance source...
##  92. run of importance source...
##  93. run of importance source...
##  94. run of importance source...
##  95. run of importance source...
##  96. run of importance source...
##  97. run of importance source...
##  98. run of importance source...
##  99. run of importance source...
##  [1] "loan_amnt"           "int_rate"            "grade"              
##  [4] "home_ownership"      "annual_inc"          "verification_status"
##  [7] "purpose"             "dti"                 "open_acc"           
## [10] "revol_bal"           "revol_util"          "total_acc"          
## [13] "total_rec_int"       "tot_cur_bal"         "total_rev_hi_lim"

With the Boruta method providing valuable insights, the next step in our analysis involves a closer examination of multicollinearity. Multicollinearity arises when one predictor variable in a multiple regression model can be accurately predicted from the others. This phenomenon can lead to skewed or misleading coefficient estimates and draw potentially erroneous conclusions.

In essence, the presence of multicollinearity introduces a level of redundancy among predictor variables, making it challenging to isolate their individual effects on the response variable. When high correlations exist between predictors, it becomes difficult for the regression model to distinguish the unique contributions of each variable, potentially compromising the accuracy and reliability of the model.

Identifying and addressing multicollinearity is critical for refining the integrity of our regression analysis. It allows us to disentangle the interdependencies between predictor variables and ensures that each variable’s impact on the response variable is accurately reflected. By navigating the complexities of multicollinearity, we aim to enhance the robustness of our regression model, providing more accurate and trustworthy insights into the relationships between predictors and the target variable.

1.7: Visualize the correlations that emerge between the numerical features. Discuss the results. Which variables are highly correlated? Decide whether you keep all variables.

correlations = cor(loan_sample_under[-c(3,4,6,7,14,17)])
corrplot(correlations) 

In addition to exploring correlations, we delve deeper into their statistical significance. Utilizing the cor_pmat function, we gain the ability to explicitly assess the significance of each correlation, with a distinct focus on highlighting the insignificant ones.

This method goes beyond merely identifying correlations; it allows us to scrutinize the strength and reliability of each correlation coefficient. By explicitly marking correlations that are deemed statistically insignificant, we enhance our ability to distinguish between robust relationships and those that may occur by chance.

The cor_pmat function becomes a valuable tool in this context, offering a visual representation that guides us in pinpointing correlations that warrant closer scrutiny. This nuanced approach not only emphasizes the presence of correlations but also refines our understanding by discerning the statistical significance associated with each, providing a more comprehensive evaluation of the interplay between variables.

p_value_mat <- cor_pmat(loan_sample_under[,-c(3,4,6,7,14,17)])
ggcorrplot(correlations, type = "lower", p.mat = p_value_mat) 

When we look at how variables in our dataset relate to each other, those with a high correlation, close to 1 or -1, are seen as strongly linked. It suggests that knowing one variable might pretty much tell us about the other, hinting at a kind of redundant or closely connected relationship.

Conversely, when the correlation is closer to 0, it means the relationship is weaker. It could be moderate or just a faint connection between variables. This indicates that understanding one variable might not necessarily give us a good prediction for another.

In our analysis, we checked correlation plots and got insights from the Boruta algorithm. Interestingly, despite seeing varying strengths in correlations, we’ve decided to keep all the variables. The reason behind this choice is that even if the relationships are not super strong, each variable brings in unique and valuable information.

The idea here is to appreciate the mix of both strong and not-so-strong connections between variables. Keeping all the variables allows us to have a more complete understanding of the dataset, considering the nuances and diverse information each variable brings to the table.

1.8: Plot an interactive scatter plot of the association between the loan amount requested and the annual income of the borrower. Discuss the plot. What can you tell about the association?

library(plotly)

# Assuming your data is in a dataframe called loan_sample and
# the relevant columns are named loan_amnt and annual_inc
plot_ly(data = loan_sample_under, x = ~loan_amnt, y = ~annual_inc, 
        type = 'scatter', mode = 'markers', 
        marker = list(size = 10, opacity = 0.5)) %>%
  layout(title = 'Association between Loan Amount and Annual Income',
         xaxis = list(title = 'Loan Amount Requested'),
         yaxis = list(title = 'Annual Income of Borrower'))

As we examine the density plot, the regions with high density unveil common combinations of income and loan amounts, offering insights into potential standard loan products or characteristic borrower profiles. The clustering of data points in these areas indicates prevalent patterns that could be indicative of well-established loan products catering to specific income brackets or borrower demographics.

Up to a salary of 20k, people tend to apply for a maximum loan of 10k. With a salary of around 40k, the requested loan amount increases to about 20k. Surprisingly, at a salary of 60k, the loan application amount is around 25k. This suggests that the increase in loan amount is not directly proportional to the income; rather, it seems to follow a diminishing trend. Once again, the exception here confirms the rule, showing that higher loans are sought with lower incomes and the other way around.

Excercise 2 Train and test a logistic classifier.

We divided the data into a 30/70 ratio, allocating 70% of the data for training purposes and reserving the remaining 30% for testing. This approach ensures that our model is trained on a substantial portion of the dataset, allowing it to learn patterns and relationships. The logistic classifier serves as our chosen model, given our specific goal of evaluating whether a loan is granted or not, as indicated by the ‘Status’ column.

The training phase involves exposing the model to the majority of the data, enabling it to grasp the underlying patterns and relationships. Subsequently, we assess its performance on the reserved testing set, which represents new, unseen data. This evaluation provides valuable insights into how well the model generalizes to real-world scenarios.

By leveraging this 30/70 split and utilizing a logistic classifier, our aim is to develop a robust and accurate model capable of making reliable predictions regarding loan approval based on the provided input features. This process aligns with best practices in machine learning, striking a balance between training and evaluation to ensure the model’s effectiveness in real-world applications.

2.1: Dividing the sample into training and testing set

# Set seed for reproducibility
set.seed(7)

# Split the data into training (70%) and testing (30%) sets
splitIndex <- createDataPartition(loan_sample_under$Status, p = 0.7, list = FALSE)
training_set <- loan_sample_under[splitIndex,]
testing_set <- loan_sample_under[-splitIndex,]

PercTable(loan_sample_under$Status)
##                
##     freq   perc
##                
## 0  5'191  50.0%
## 1  5'190  50.0%
PercTable(training_set$Status)
##                
##     freq   perc
##                
## 0  3'634  50.0%
## 1  3'633  50.0%
PercTable(testing_set$Status)
##                
##     freq   perc
##                
## 0  1'557  50.0%
## 1  1'557  50.0%

It can be seen that positive and negative status values have been fairly distributed across the training and testing datasets.

2.2: Train the classifier and report the coefficients obtained and interpret the results.

In the next step, we train the logit model. In terms of our inputs i.e. our Xs, we use all variables included in the data_new_under apart from the status, which is our Y. How would you interpret the results printed from the summary fit1?

## 
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -2.502e+00  3.188e-01  -7.846 4.29e-15 ***
## loan_amnt                           6.021e-05  6.042e-06   9.965  < 2e-16 ***
## int_rate                            1.124e-01  1.531e-02   7.341 2.12e-13 ***
## gradeB                              3.985e-01  1.049e-01   3.798 0.000146 ***
## gradeC                              5.648e-01  1.367e-01   4.133 3.58e-05 ***
## gradeD                              6.147e-01  2.014e-01   3.053 0.002268 ** 
## home_ownershipOWN                   6.345e-03  9.334e-02   0.068 0.945806    
## home_ownershipRENT                  2.215e-01  6.796e-02   3.259 0.001119 ** 
## annual_inc                         -5.566e-06  1.183e-06  -4.706 2.52e-06 ***
## verification_statusSource Verified  1.267e-01  5.950e-02   2.130 0.033164 *  
## verification_statusVerified         7.239e-02  6.820e-02   1.061 0.288522    
## purposecredit_card                 -5.311e-02  2.425e-01  -0.219 0.826621    
## purposedebt_consolidation          -3.014e-02  2.385e-01  -0.126 0.899433    
## purposehome_improvement             1.459e-01  2.608e-01   0.560 0.575775    
## purposehouse                       -7.137e-01  4.187e-01  -1.705 0.088253 .  
## purposemajor_purchase              -1.038e-02  2.969e-01  -0.035 0.972113    
## purposemedical                      9.694e-02  3.387e-01   0.286 0.774748    
## purposemoving                      -6.495e-01  3.769e-01  -1.723 0.084845 .  
## purposeother                       -1.591e-01  2.552e-01  -0.623 0.533091    
## purposerenewable_energy             9.113e-01  9.181e-01   0.993 0.320917    
## purposesmall_business               3.667e-01  3.500e-01   1.048 0.294720    
## purposevacation                     8.011e-02  3.429e-01   0.234 0.815283    
## purposewedding                     -1.836e+00  8.557e-01  -2.145 0.031959 *  
## dti                                 1.412e-02  3.644e-03   3.875 0.000107 ***
## open_acc                            4.295e-02  9.351e-03   4.593 4.36e-06 ***
## revol_bal                          -1.064e-05  7.926e-06  -1.342 0.179469    
## revol_util                          1.363e-03  1.988e-03   0.686 0.492961    
## total_acc                          -8.441e-03  3.917e-03  -2.155 0.031175 *  
## total_rec_int                      -2.410e-04  2.285e-05 -10.545  < 2e-16 ***
## application_typeJoint App           1.403e-01  2.058e-01   0.682 0.495248    
## tot_cur_bal                        -7.800e-07  3.778e-07  -2.065 0.038950 *  
## total_rev_hi_lim                    1.483e-06  4.517e-06   0.328 0.742687    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10074.2  on 7266  degrees of freedom
## Residual deviance:  9186.6  on 7235  degrees of freedom
## AIC: 9250.6
## 
## Number of Fisher Scoring iterations: 4

When interest rates go up, the chances of getting a loan approved also increase. Other factors like ‘grade,’ ‘home ownership,’ ‘annual income,’ and ‘total interest received’ play a significant role too.

The starting point for non-approval (Status ‘0’) is around -2.5. Lower p-values mean a stronger impact of a variable.

##  [1] "loan_amnt"                          "int_rate"                          
##  [3] "gradeB"                             "gradeC"                            
##  [5] "gradeD"                             "home_ownershipRENT"                
##  [7] "annual_inc"                         "verification_statusSource Verified"
##  [9] "purposewedding"                     "dti"                               
## [11] "open_acc"                           "total_acc"                         
## [13] "total_rec_int"                      "tot_cur_bal"

We can print out only the significant variables with p-value lower than 0.05. We notice that 14 variables are found statistically significant.

2.3: Plot the ROC and the Precision/Recall Curve and interpret the results.

Next, we aim to evaluate the predictive performance of our model. To do so, we will plot the ROC curve.

testing_set$fit1_score <- predict(fit1,type='response',testing_set)
fit1_pred <- prediction(testing_set$fit1_score, testing_set$Status)
fit1_roc <- performance(fit1_pred, "tpr", "fpr")
plot(fit1_roc, lwd=1, colorize = TRUE, main = "Fit1: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)

In the subsequent step of our analysis, we delve into the intricacies of our model by visualizing the Precision/Recall Curve. This curve serves as a comprehensive illustration of the delicate balance between the true positive rate and the positive predictive value across various probability thresholds.

As we navigate through different probability thresholds, the Precision/Recall Curve provides a nuanced perspective on the model’s performance. It vividly captures the trade-offs inherent in our predictive model, offering valuable insights into how adjusting the probability threshold impacts both precision and recall.

Precision, representing the accuracy of positive predictions, and recall, measuring the model’s ability to capture all actual positive instances, are pivotal metrics in evaluating classifier performance. By exploring this curve, we gain a deeper understanding of the model’s behavior across a spectrum of probability thresholds, enabling us to make informed decisions about the optimal threshold for our specific use case.

This visual representation not only aids in model evaluation but also guides us in fine-tuning our predictive approach for achieving the desired balance between precision and recall, aligning with the specific objectives and requirements of our analysis.

We observed that our ROC curve is not ideal. While the curve is present, ideally, it should have a more pronounced separation from the linear line and a steeper ascent at the beginning. But it shows that the analysis isn’t made to wrong just the model could be tarined better or we need a different type of dataset.

fit1_precision <- performance(fit1_pred, measure = "prec", x.measure = "rec")
plot(fit1_precision, main="Fit1: Logit - Precision vs Recall")

The curve starts with a high precision at low recall, which could indicate that the model is initially very selective and only classifies as positive those cases for which it is very certain. This leads to high precision, but not many actual positive cases are detected. With increasing recall, i.e. when the model tries to capture more positive cases, the precision decreases. This shows the typical trade-off relationship between precision and recall: if you try to detect more positive cases (increase the recall), the precision usually decreases because the model then also misclassifies more cases as positive.

An ideal model would have a curve near the upper right corner of the graph, which means that it can achieve both high precision and high recall

2.4: Produce the confusion matrix and interpret the results.

# Extract the confusion matrix 
cm <- confusionMatrix(as.factor(round(testing_set$fit1_score)), testing_set$Status)
print(cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 997 561
##          1 560 996
##                                           
##                Accuracy : 0.64            
##                  95% CI : (0.6229, 0.6569)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.28            
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6403          
##             Specificity : 0.6397          
##          Pos Pred Value : 0.6399          
##          Neg Pred Value : 0.6401          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3202          
##    Detection Prevalence : 0.5003          
##       Balanced Accuracy : 0.6400          
##                                           
##        'Positive' Class : 0               
## 

2.5 Report the AUC values and the overall accuracy and interpret the results.

# AUC
fit1_auc <- performance(fit1_pred, measure = "auc")

# Overall accuracy
accuracy <- sum(diag(cm$table)) / sum(cm$table)

# Print of the values
cat("AUC: ", fit1_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy)
## AUC:  69.61645 
## Overall Accuracy:  0.6400128

The AUC, or “Area Under the Curve,” is a measure of the performance of a classification model, and a value of 69.61645 corresponds to approximately 69.62%. This value indicates how well the model can distinguish between classes; a value of 100% would mean perfect discrimination, while 50% indicates no better than random chance – akin to flipping a coin. Although an AUC of around 69.62% suggests that the model is better than random, it is often insufficient in many applications, especially in critical areas like credit scoring. In credit scoring, higher standards of predictive accuracy are demanded because misclassifications can lead to financial losses or unjustified credit denial.

The overall accuracy of the model is 64.00128%, which is only marginally better than making a random choice. An accuracy rate of about 64% is frequently inadequate for decisions where the reliability of the prediction is of great importance, such as determining an individual’s creditworthiness. With such an accuracy rate, the risk of incorrectly classifying individuals as creditworthy or not is relatively high. Practically, this would mean that nearly four out of ten credit decisions might be incorrect, which could have significant adverse effects for both lenders and potential borrowers. Therefore, it would be crucial to further refine and improve the model to achieve higher accuracy before employing it in a real-world credit lending process.

Excercise 3

Thinking about the pre-processing steps that you carried out before training the logistic classifier: Can you think of a way to improve the predictive performance of your data?

3.1.1: Training the logistic classifier without any preprocessing and feature selection stepps

# Set seed for reproducibility
set.seed(7)

# Splitting the data into testing and training data
splitIndex2 <- createDataPartition(loan_sample$Status, p = 0.7, list = FALSE)
training_set2 <- loan_sample[splitIndex2,]
testing_set2 <- loan_sample[-splitIndex2,]

# Percentage table of Status with out preprocessing
PercTable(loan_sample$Status)
##                
##     freq   perc
##                
## 0 34'810  87.0%
## 1  5'190  13.0%
PercTable(training_set2$Status)
##                
##     freq   perc
##                
## 0 24'367  87.0%
## 1  3'633  13.0%
PercTable(testing_set2$Status)
##                
##     freq   perc
##                
## 0 10'443  87.0%
## 1  1'557  13.0%
# Fitting the model
fit2 <- glm(Status ~ ., data=training_set2,family=binomial())
summary(fit2)
## 
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set2)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -4.312e+00  2.390e-01 -18.041  < 2e-16 ***
## loan_amnt                           6.705e-05  4.327e-06  15.496  < 2e-16 ***
## int_rate                            1.032e-01  9.869e-03  10.452  < 2e-16 ***
## gradeB                              3.546e-01  8.463e-02   4.190 2.79e-05 ***
## gradeC                              6.056e-01  1.012e-01   5.984 2.18e-09 ***
## gradeD                              6.676e-01  1.414e-01   4.721 2.35e-06 ***
## home_ownershipOWN                   6.147e-02  6.938e-02   0.886 0.375618    
## home_ownershipRENT                  2.356e-01  5.079e-02   4.638 3.51e-06 ***
## annual_inc                         -5.235e-06  8.955e-07  -5.846 5.04e-09 ***
## verification_statusSource Verified  7.956e-02  4.464e-02   1.782 0.074712 .  
## verification_statusVerified         7.838e-02  5.021e-02   1.561 0.118519    
## purposecredit_card                 -6.694e-02  1.891e-01  -0.354 0.723338    
## purposedebt_consolidation          -6.613e-02  1.859e-01  -0.356 0.722028    
## purposehome_improvement             1.180e-01  2.022e-01   0.584 0.559546    
## purposehouse                       -2.357e-01  3.278e-01  -0.719 0.472001    
## purposemajor_purchase               5.665e-03  2.253e-01   0.025 0.979936    
## purposemedical                      2.148e-01  2.394e-01   0.897 0.369624    
## purposemoving                      -7.206e-01  3.088e-01  -2.334 0.019604 *  
## purposeother                        1.113e-02  1.971e-01   0.056 0.954971    
## purposerenewable_energy             5.033e-01  5.596e-01   0.899 0.368417    
## purposesmall_business               2.876e-01  2.463e-01   1.168 0.242874    
## purposevacation                     3.334e-01  2.538e-01   1.313 0.189048    
## purposewedding                     -8.367e-02  5.370e-01  -0.156 0.876169    
## dti                                 1.374e-02  2.605e-03   5.273 1.34e-07 ***
## open_acc                            3.992e-02  6.667e-03   5.989 2.11e-09 ***
## revol_bal                          -6.500e-08  5.694e-06  -0.011 0.990891    
## revol_util                          1.746e-03  1.437e-03   1.215 0.224529    
## total_acc                          -1.007e-02  2.862e-03  -3.519 0.000433 ***
## total_rec_int                      -2.345e-04  1.591e-05 -14.742  < 2e-16 ***
## application_typeJoint App           8.834e-02  1.401e-01   0.631 0.528206    
## tot_cur_bal                        -5.014e-07  2.871e-07  -1.747 0.080667 .  
## total_rev_hi_lim                   -6.259e-06  3.384e-06  -1.850 0.064351 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21611  on 27999  degrees of freedom
## Residual deviance: 19941  on 27968  degrees of freedom
## AIC: 20005
## 
## Number of Fisher Scoring iterations: 5
# Print out only the significant variables with p-value lower than 0.05.
significant.variables2 <- summary(fit2)$coeff[-1,4] < 0.05
names(significant.variables2)[significant.variables2 == TRUE]
##  [1] "loan_amnt"          "int_rate"           "gradeB"            
##  [4] "gradeC"             "gradeD"             "home_ownershipRENT"
##  [7] "annual_inc"         "purposemoving"      "dti"               
## [10] "open_acc"           "total_acc"          "total_rec_int"
testing_set2$fit2_score <- predict(fit2,type='response',testing_set2)
fit2_pred <- prediction(testing_set2$fit2_score, testing_set2$Status)
fit2_roc <- performance(fit2_pred, "tpr", "fpr")
plot(fit2_roc, lwd=1, colorize = TRUE, main = "Fit2: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)

fit2_precision <- performance(fit2_pred, measure = "prec", x.measure = "rec")
plot(fit2_precision, main="Fit2: Logit - Precision vs Recall")

# Extract the confusion matrix 
cm2 <- confusionMatrix(as.factor(round(testing_set2$fit2_score)), testing_set2$Status)
print(cm2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10414  1534
##          1    29    23
##                                           
##                Accuracy : 0.8698          
##                  95% CI : (0.8636, 0.8757)
##     No Information Rate : 0.8702          
##     P-Value [Acc > NIR] : 0.5714          
##                                           
##                   Kappa : 0.0204          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99722         
##             Specificity : 0.01477         
##          Pos Pred Value : 0.87161         
##          Neg Pred Value : 0.44231         
##              Prevalence : 0.87025         
##          Detection Rate : 0.86783         
##    Detection Prevalence : 0.99567         
##       Balanced Accuracy : 0.50600         
##                                           
##        'Positive' Class : 0               
## 
# AUC
fit2_auc <- performance(fit2_pred, measure = "auc")

# Overall accuracy
accuracy2 <- sum(diag(cm2$table)) / sum(cm2$table)

# Print of the values
cat("AUC: ", fit2_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy2)
## AUC:  69.35202 
## Overall Accuracy:  0.86975

We discovered that our model exhibited greater precision without any preprocessing. This led us to reconsider the impact of outlier removal, and it appears that their presence was, in fact, crucial for achieving a more accurate model with higher precision. Deleting outliers and balancing the dataset seemed to have adversely affected the model’s performance, prompting us to recognize the importance of these data points in obtaining a robust and precise model. Through balancing we lost round about 75% of our basis data which led us to way less data with which we could have trained our model.

3.1.2: Training the logistic classifier without the collumn application type

# Set seed for reproducibility
set.seed(7)

# Deleting column application_type
loan_sample_under_new <- loan_sample_under[-14]

# Splitting the data into testing and training data
splitIndex3 <- createDataPartition(loan_sample_under_new$Status, p = 0.7, list = FALSE)
training_set3 <- loan_sample_under_new[splitIndex3,]
testing_set3 <- loan_sample_under_new[-splitIndex3,]

# Percentage table of Status with out preprocessing
PercTable(loan_sample_under_new$Status)
##                
##     freq   perc
##                
## 0  5'191  50.0%
## 1  5'190  50.0%
PercTable(training_set3$Status)
##                
##     freq   perc
##                
## 0  3'634  50.0%
## 1  3'633  50.0%
PercTable(testing_set3$Status)
##                
##     freq   perc
##                
## 0  1'557  50.0%
## 1  1'557  50.0%
# Fitting the model
fit3 <- glm(Status ~ ., data=training_set3,family=binomial())
summary(fit3)
## 
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set3)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -2.503e+00  3.188e-01  -7.852 4.09e-15 ***
## loan_amnt                           6.064e-05  6.010e-06  10.090  < 2e-16 ***
## int_rate                            1.127e-01  1.530e-02   7.364 1.78e-13 ***
## gradeB                              3.981e-01  1.049e-01   3.794 0.000148 ***
## gradeC                              5.639e-01  1.367e-01   4.127 3.68e-05 ***
## gradeD                              6.123e-01  2.014e-01   3.041 0.002358 ** 
## home_ownershipOWN                   5.751e-03  9.334e-02   0.062 0.950871    
## home_ownershipRENT                  2.216e-01  6.796e-02   3.260 0.001114 ** 
## annual_inc                         -5.612e-06  1.181e-06  -4.752 2.01e-06 ***
## verification_statusSource Verified  1.261e-01  5.949e-02   2.120 0.034004 *  
## verification_statusVerified         7.273e-02  6.820e-02   1.067 0.286189    
## purposecredit_card                 -5.463e-02  2.424e-01  -0.225 0.821709    
## purposedebt_consolidation          -3.023e-02  2.385e-01  -0.127 0.899133    
## purposehome_improvement             1.451e-01  2.608e-01   0.557 0.577858    
## purposehouse                       -7.130e-01  4.189e-01  -1.702 0.088708 .  
## purposemajor_purchase              -1.169e-02  2.968e-01  -0.039 0.968574    
## purposemedical                      1.040e-01  3.384e-01   0.307 0.758693    
## purposemoving                      -6.480e-01  3.768e-01  -1.720 0.085432 .  
## purposeother                       -1.589e-01  2.552e-01  -0.623 0.533444    
## purposerenewable_energy             9.099e-01  9.179e-01   0.991 0.321555    
## purposesmall_business               3.653e-01  3.499e-01   1.044 0.296470    
## purposevacation                     7.889e-02  3.429e-01   0.230 0.818059    
## purposewedding                     -1.838e+00  8.560e-01  -2.147 0.031786 *  
## dti                                 1.429e-02  3.635e-03   3.931 8.45e-05 ***
## open_acc                            4.268e-02  9.342e-03   4.568 4.91e-06 ***
## revol_bal                          -1.072e-05  7.925e-06  -1.353 0.176182    
## revol_util                          1.342e-03  1.988e-03   0.675 0.499482    
## total_acc                          -8.411e-03  3.917e-03  -2.147 0.031764 *  
## total_rec_int                      -2.420e-04  2.280e-05 -10.615  < 2e-16 ***
## tot_cur_bal                        -7.642e-07  3.770e-07  -2.027 0.042670 *  
## total_rev_hi_lim                    1.499e-06  4.516e-06   0.332 0.739938    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10074.2  on 7266  degrees of freedom
## Residual deviance:  9187.1  on 7236  degrees of freedom
## AIC: 9249.1
## 
## Number of Fisher Scoring iterations: 4
# Print out only the significant variables with p-value lower than 0.05.
significant.variables3 <- summary(fit3)$coeff[-1,4] < 0.05
names(significant.variables3)[significant.variables3 == TRUE]
##  [1] "loan_amnt"                          "int_rate"                          
##  [3] "gradeB"                             "gradeC"                            
##  [5] "gradeD"                             "home_ownershipRENT"                
##  [7] "annual_inc"                         "verification_statusSource Verified"
##  [9] "purposewedding"                     "dti"                               
## [11] "open_acc"                           "total_acc"                         
## [13] "total_rec_int"                      "tot_cur_bal"
testing_set3$fit3_score <- predict(fit3,type='response',testing_set3)
fit3_pred <- prediction(testing_set3$fit3_score, testing_set3$Status)
fit3_roc <- performance(fit3_pred, "tpr", "fpr")
plot(fit3_roc, lwd=1, colorize = TRUE, main = "Fit3: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)

fit3_precision <- performance(fit3_pred, measure = "prec", x.measure = "rec")
plot(fit3_precision, main="Fit3: Logit - Precision vs Recall")

# Extract the confusion matrix 
cm3 <- confusionMatrix(as.factor(round(testing_set3$fit3_score)), testing_set3$Status)
print(cm3)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1001  560
##          1  556  997
##                                           
##                Accuracy : 0.6416          
##                  95% CI : (0.6245, 0.6585)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.2832          
##                                           
##  Mcnemar's Test P-Value : 0.9284          
##                                           
##             Sensitivity : 0.6429          
##             Specificity : 0.6403          
##          Pos Pred Value : 0.6413          
##          Neg Pred Value : 0.6420          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3215          
##    Detection Prevalence : 0.5013          
##       Balanced Accuracy : 0.6416          
##                                           
##        'Positive' Class : 0               
## 
# AUC
fit3_auc <- performance(fit3_pred, measure = "auc")

# Overall accuracy
accuracy3 <- sum(diag(cm3$table)) / sum(cm3$table)

# Print of the values
cat("AUC: ", fit3_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy3)
## AUC:  69.62428 
## Overall Accuracy:  0.6416185

By removing the ‘application type’ feature, we were able to make a slight improvement in our model’s accuracy. However, the accuracy is still far from being sufficient to deploy the model in a loan approval process.

3.2: What can you do differently? (hint: Feel free to be creative and discuss any additional step in data collection and/or data pre-processing that you might try so to improve the results)

Historical Data:

Historical data is the backbone of predictive modeling in credit scoring. It encompasses records of past financial behavior, including loan applications, repayment histories, defaults, and recoveries. Analyzing this data helps in understanding trends and identifying patterns that could predict future loan performance. For example, by examining the characteristics of loans that have defaulted in the past, one can often find indicators that may signal a potential future default. Moreover, historical data allows for the backtesting of models to ensure they perform well on unseendata before deployment.

Broader Data Sources:

To improve the generalizability of a model – its ability to perform well across various populations and economic conditions – incorporating broader data sources is key. This means looking beyond the traditional credit reports and including alternative data such as rent payments, utility bill history, and even non-financial personal data. The premise is that the more diverse the data, the less the model relies on patterns specific to a particular dataset, which can reduce overfitting. Overfitting occurs when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. By training on a more comprehensive set of data points, the model can better capture the complexities of human financial behavior and thus generalize better when applied to the broader population.

Costumer Data:

Incorporating detailed customer data can further refine a model’s predictions. This data can include information provided by the customer during the application process, like employment stability, income level, education, and even lifestyle choices. With the customer’s consent, more granular data such as transaction histories or savings patterns can be used to construct a more nuanced profile. For example, consistent saving patterns might indicate financial stability, which could be a positive signal in credit scoring.

Feature selection:

feature selection is an iterative process, and it’s crucial to strike a balance between model complexity and predictive accuracy. Regularly validating and updating the feature set based on ongoing analysis and performance monitoring is essential for maintaining a robust credit scoring model.

Excercise 4

Finally, thinking about putting your model into action and basing credit decisions on the prediction that it generates: ### What kind of challenges may a company face if it would use your model in their daily business. In particular in regard to ethical challenges and moral obligations companies have? Please refer to the „common ethical issues in the context the creation of value from data” (see slides week 11) in your answer.:

Data and Privacy security: When using customer data, it is crucial to ensure that this information is securely stored in accordance with the data protection regulations of the respective economic region. Additionally, measures should be implemented to protect against security vulnerabilities. In Switzerland, for instance, this process would likely fall under the jurisdiction of “FinMa” The release of customer data, whether due to hacking incidents, other errors, or system malfunctions, could have a significantly adverse impact on customers.

Algorithmic Bias & Fairness: Models may inadvertently incorporate biases present in the training data, leading to unfair treatment of certain groups. This is particularly sensitive in credit scoring, which can affect an individual’s financial opportunities.

Accountability & Responsibility: Companies must be able to account for the decisions made by their models. This means having clear processes in place for when things go wrong or when the model’s decisions have a significant impact on an individual’s life, such as being denied a loan.

Impact on the Workforce: The introduction of AI and machine learning models into business processes can disrupt job markets, potentially leading to job displacement. Companies have a moral obligation to consider how their use of technology will impact their workforce and to take steps to mitigate negative impacts.

Can you think of a way how companies can overcome or at least mitigate the issues that you described above?

Enhance Transparency: By being open about how their models work and the data they use, companies can build trust with their clients and stakeholders.

Implement Bias Auditing Procedures: Regularly test and update models to ensure that they are free from bias and make corrections where necessary. Establish Governance Frameworks: Create robust data governance frameworks that outline how data is to be used, managed, and protected.

Develop Ethical Guidelines: Create and adhere to a set of ethical guidelines that dictate how AI and data analytics are to be used in the business.

Invest in Employee Training: Invest in training for employees to ensure that they understand the ethical implications of their work and the technology they use.

Engage with Stakeholders: Include stakeholders in the discussion about how AI is used in the business and the impacts it may have on various groups.

By addressing these challenges proactively, companies can not only ensure they are meeting their ethical obligations but also build a foundation of trust with customers and the wider public

Conclusion: In conclusion we are not really happy with how our model workes because its to inaccurate. We discussed this issue in our group and are thinking about that we could try to use other and even broader datasets to get a better accuracy and figure out how to make this model work properly.